p1Data <- read.csv("../data/p1Data.csv", header=TRUE)
p2Data <- read.csv("../data/p2Data.csv", header=TRUE)
p3Data <- read.csv("../data/p3Data.csv", header=TRUE)

Part 1 - Recovering a True Model (Beginner Level)

We believe the true model to be… (found after doing all of the work below…)

\[ Y_i = \underbrace{\beta_0 + \beta_1 X4_i + \beta_2 X2_i X4_i}_\text{The True Model} + \epsilon_i \]

And estimate that model by…

\[ \hat{Y}_i = -0.7065 + 2.3958 X4_i + 1.7763 X2_i X4_i \]

with our estimate of \(\sigma\) as …

pairs(p1Data)

In looking at the above plot, X4, X2, and maybe even X1 show some structure with Y.

Let’s begin with X4.

plot(Y ~ X4, data=p1Data)

lm1 <- lm(Y ~ X4, data=p1Data)
summary(lm1)
## 
## Call:
## lm(formula = Y ~ X4, data = p1Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.401 -20.814  -8.875  25.900  73.082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.8909    13.6788  -0.211    0.834    
## X4            3.0458     0.4205   7.243 3.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.61 on 48 degrees of freedom
## Multiple R-squared:  0.5222, Adjusted R-squared:  0.5123 
## F-statistic: 52.47 on 1 and 48 DF,  p-value: 3.117e-09
pairs(cbind(R=lm1$res,Fit=lm1$fitted.values,p1Data))

Looking at the above, the R vs. Fit plot shows some structure, meaning something else is still going on, but X4 cannot explain it. Looks like X2 could offer some further insight.

plot(Y ~ X4, data=p1Data, col=as.factor(X2))

lm2 <- lm(Y ~ X4 + X2, data=p1Data)
summary(lm2)
## 
## Call:
## lm(formula = Y ~ X4 + X2, data = p1Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.668  -9.683   0.259   9.193  36.577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.7850     6.8608  -2.592   0.0127 *  
## X4            2.9463     0.2077  14.184  < 2e-16 ***
## X2           56.1377     4.5838  12.247 3.13e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.11 on 47 degrees of freedom
## Multiple R-squared:  0.886,  Adjusted R-squared:  0.8812 
## F-statistic: 182.7 on 2 and 47 DF,  p-value: < 2.2e-16
pairs(cbind(R=lm2$res,Fit=lm2$fitted.values,p1Data), col=as.factor(p1Data$X2))

Let’s check for an interaction between X2 and X4.

plot(Y ~ X4, data=p1Data, col=as.factor(X2))

lm3 <- lm(Y ~ X4 + X2 + X4:X2, data=p1Data)
summary(lm3)
## 
## Call:
## lm(formula = Y ~ X4 + X2 + X4:X2, data = p1Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.814  -7.849   1.644   6.957  22.187 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.0620     7.0550  -0.434 0.666306    
## X4            2.4649     0.2185  11.281 7.68e-15 ***
## X2            7.8272    12.8606   0.609 0.545768    
## X4:X2         1.5495     0.3920   3.953 0.000264 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.19 on 46 degrees of freedom
## Multiple R-squared:  0.9149, Adjusted R-squared:  0.9094 
## F-statistic: 164.9 on 3 and 46 DF,  p-value: < 2.2e-16

Looks much better by R-squared value, but X2 base term no longer significant. How does the model look if we remove X2?

lm4 <- lm(Y ~ X4 + X4:X2, data=p1Data)
summary(lm4)
## 
## Call:
## lm(formula = Y ~ X4 + X4:X2, data = p1Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.490  -8.209   2.623   6.654  21.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.7065     5.8591  -0.121    0.905    
## X4            2.3958     0.1854  12.920   <2e-16 ***
## X4:X2         1.7763     0.1212  14.656   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.11 on 47 degrees of freedom
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9106 
## F-statistic: 250.5 on 2 and 47 DF,  p-value: < 2.2e-16
pairs(cbind(R=lm4$res,Fit=lm4$fitted.values,p1Data), col=as.factor(p1Data$X2))

I like it. Let’s end the search.

par(mfrow=c(1,3))
plot(lm4, which=1:2)
plot(lm4$res)

What if we took a wrong turn and went for the quartic model on X4 instead of the interaction of X2:X4 model?

Maybe not a wrong turn actually… better R-squared.

lmOther <- lm(Y ~ X4 + X2 + I(X4^2) + I(X4^3) + I(X4^4), data=p1Data)
summary(lmOther)
## 
## Call:
## lm(formula = Y ~ X4 + X2 + I(X4^2) + I(X4^3) + I(X4^4), data = p1Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.2953 -10.6030   0.6181   8.2609  26.8756 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.332e+02  1.083e+02   3.998 0.000241 ***
## X4          -7.329e+01  1.773e+01  -4.134 0.000158 ***
## X2           5.511e+01  4.006e+00  13.757  < 2e-16 ***
## I(X4^2)      4.449e+00  1.011e+00   4.400 6.79e-05 ***
## I(X4^3)     -1.076e-01  2.413e-02  -4.461 5.59e-05 ***
## I(X4^4)      9.212e-04  2.051e-04   4.491 5.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.9 on 44 degrees of freedom
## Multiple R-squared:  0.9222, Adjusted R-squared:  0.9134 
## F-statistic: 104.4 on 5 and 44 DF,  p-value: < 2.2e-16
plot(Y ~ X4, data=p1Data, col=factor(X2))
b <- coef(lmOther)
x2 <- 0
curve(b[1] + b[2]*x + b[3]*x2 + b[4]*x^2 + b[5]*x^3 + b[6]*x^4, add=TRUE, col="black")
x2 <- 1
curve(b[1] + b[2]*x + b[3]*x2 + b[4]*x^2 + b[5]*x^3 + b[6]*x^4, add=TRUE, col="red")

Part 2 - Recovering a True Model (Intermediate Level)

We believe the true model to be… (found after all of the work below…)

\[ Y_i = \underbrace{\beta_0 + \beta_1 X5_i + \beta_2 X5^2_i + \beta_3 X3_i}_\text{The True Model} + \epsilon_i \]

And estimate that model by…

\[ \hat{Y}_i = 3.76 + 8.29 X5_i + 1.69 X5^2_i + 2.39 X3_i \]

with our estimate of \(\sigma\) as …

pairs(p2Data)

It’s quite clear that X5 with a quadratic term is needed.

lm1 <- lm(Y ~ X5, data=p2Data)
pairs(cbind(R=lm1$res,Fit=lm1$fitted.values,p2Data))

lm2 <- lm(Y ~ X5 + I(X5^2), data=p2Data)
summary(lm2)
## 
## Call:
## lm(formula = Y ~ X5 + I(X5^2), data = p2Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25.79  -9.40  -1.11  11.24  25.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 217.7607     2.8734   75.79   <2e-16 ***
## X5            7.3966     0.5724   12.92   <2e-16 ***
## I(X5^2)       1.7358     0.1155   15.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.97 on 47 degrees of freedom
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9244 
## F-statistic: 300.7 on 2 and 47 DF,  p-value: < 2.2e-16

Even though we already had a great model, one last check reveals a surprising result that X3 would add a great deal more information to the model.

pairs(cbind(R=lm2$res,Fit=lm2$fitted.values,p2Data))

Add X3.

lm3 <- lm(Y ~ X5 + I(X5^2) + X3, data=p2Data)
summary(lm3)
## 
## Call:
## lm(formula = Y ~ X5 + I(X5^2) + X3, data = p2Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5216 -1.3482 -0.2922  0.6103  3.9575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.76394    3.47034   1.085    0.284    
## X5           8.28866    0.06462 128.273   <2e-16 ***
## I(X5^2)      1.69025    0.01274 132.715   <2e-16 ***
## X3           2.39367    0.03866  61.922   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.647 on 46 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.9991 
## F-statistic: 1.783e+04 on 3 and 46 DF,  p-value: < 2.2e-16
pairs(cbind(R=lm3$res,Fit=lm3$fitted.values,p2Data))

Looks like we are done. However, just to try a few more quick ideas…

pairs(cbind(R=lm3$res,Fit=lm3$fitted.values,p2Data), col=as.factor(p2Data$X2))

pairs(cbind(R=lm3$res,Fit=lm3$fitted.values,p2Data), col=as.factor(p2Data$X7))

Does anything in X7 add anything?

summary(lm(Y ~ X5 + I(X5^2) + X3 + factor(X7), data=p2Data))
## 
## Call:
## lm(formula = Y ~ X5 + I(X5^2) + X3 + factor(X7), data = p2Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6599 -1.1951 -0.3341  0.6076  4.2183 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.02813    3.67585   0.824    0.415    
## X5           8.29788    0.06724 123.405   <2e-16 ***
## I(X5^2)      1.69050    0.01385 122.096   <2e-16 ***
## X3           2.39888    0.04010  59.830   <2e-16 ***
## factor(X7)2  0.21807    0.62696   0.348    0.730    
## factor(X7)3  0.53704    0.58548   0.917    0.364    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.668 on 44 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9991 
## F-statistic: 1.043e+04 on 5 and 44 DF,  p-value: < 2.2e-16

Nothing. Leave it alone.

Part 3 - Recovering a True Model (Advanced Level)

We believe the true model to be…

\[ Y_i = \underbrace{\beta_0 + \beta_1 X6_i + \beta_2 X8_i + \beta_3 X15_i + \beta_4 X6_i X8_i + \beta_5 X6_i X15_i + \beta_6 X8_i X15_i + \beta_7 X6_i X8_i X15_i}_\text{The True Model} + \epsilon_i \]

And estimate that model by…

\[ \hat{Y}_i = -3.712 + 1.956 X6_i + 2.63 X8_i + 9.59 X15_i + 1.557 X6_i X8_i - 4.522 X6_i X15_i -5.12 X8_i X15_i + 1.82 X6_i X8_i X15_i \]

with our estimate of \(\sigma\) as … 0.8219 (the residual standard error).

Here is how we came up with that guess…

#using ```{r, fig.height=10, fig.width=10} in the R-chunk header allows use to zoom on in this pairs plot when we knit the file.
pairs(p3Data)

Start by using X6 in the model.

lm1 <- lm(Y ~ X6, data=p3Data)
summary(lm1)
## 
## Call:
## lm(formula = Y ~ X6, data = p3Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0156 -1.6855  0.0478  1.3605  4.7334 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2779     0.4265   2.997  0.00431 ** 
## X6            4.4754     0.6031   7.421 1.67e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 48 degrees of freedom
## Multiple R-squared:  0.5343, Adjusted R-squared:  0.5246 
## F-statistic: 55.07 on 1 and 48 DF,  p-value: 1.67e-09

Color by X6.

pairs(cbind(R=lm1$res,Fit=lm1$fitted.values,p3Data), col=as.factor(p3Data$X6))

Looks like X8 matters in an interaction way.

lm2 <- lm(Y ~ X6 + X8 + X6:X8, data=p3Data)
summary(lm2)
## 
## Call:
## lm(formula = Y ~ X6 + X8 + X6:X8, data = p3Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7979 -1.1900  0.0022  1.1824  3.3366 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.2060     1.0423   0.198  0.84421   
## X6           -0.3060     1.5588  -0.196  0.84522   
## X8            0.5580     0.5127   1.088  0.28214   
## X6:X8         2.4298     0.7633   3.183  0.00261 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.705 on 46 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.6962 
## F-statistic: 38.42 on 3 and 46 DF,  p-value: 1.383e-12

Drop X6?

lm3 <- lm(Y ~ X8 + X6:X8, data=p3Data)
summary(lm3)
## 
## Call:
## lm(formula = Y ~ X8 + X6:X8, data = p3Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8394 -1.2030 -0.0126  1.1410  3.4114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.06915    0.76706   0.090    0.929    
## X8           0.62158    0.39329   1.580    0.121    
## X8:X6        2.28735    0.23377   9.785 6.43e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.687 on 47 degrees of freedom
## Multiple R-squared:  0.7145, Adjusted R-squared:  0.7024 
## F-statistic: 58.82 on 2 and 47 DF,  p-value: 1.607e-13

Drop X8?

lm4 <- lm(Y ~ X6:X8, data=p3Data)
summary(lm4)
## 
## Call:
## lm(formula = Y ~ X6:X8, data = p3Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0347 -1.3120  0.1014  1.0504  3.0518 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.1682     0.3287   3.554 0.000864 ***
## X6:X8         2.3965     0.2268  10.567 4.04e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.713 on 48 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6931 
## F-statistic: 111.7 on 1 and 48 DF,  p-value: 4.041e-14

Hmm… X6 isn’t quite capturing everything that is happening inside of X8.

pairs(cbind(R=lm4$res,Fit=lm4$fitted.values,p3Data), col=as.factor(p3Data$X6))

Let’s try coloring by some of the other discrete variables, like X15 (looks promising!) and X18 (not so useful).

pairs(cbind(R=lm4$res,Fit=lm4$fitted.values,p3Data), col=as.factor(p3Data$X15))

pairs(cbind(R=lm4$res,Fit=lm4$fitted.values,p3Data), col=as.factor(p3Data$X18))

Try adding X15 and X15:X8…

lm5 <- lm(Y ~ X6:X8 + X15 + X15:X8, data=p3Data)
summary(lm5)
## 
## Call:
## lm(formula = Y ~ X6:X8 + X15 + X15:X8, data = p3Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.09785 -0.90274 -0.02559  0.84449  2.98867 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2314     0.3100   3.972 0.000248 ***
## X15           4.2943     1.2645   3.396 0.001418 ** 
## X6:X8         2.6201     0.1957  13.387  < 2e-16 ***
## X8:X15       -2.4982     0.5741  -4.352 7.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.429 on 46 degrees of freedom
## Multiple R-squared:  0.7996, Adjusted R-squared:  0.7866 
## F-statistic: 61.19 on 3 and 46 DF,  p-value: 4.329e-16

Now to draw the model, which is getting fairly involved. In fact, with X6, X8, and X15 our regression model is in four-dimensional space (including Y). However, since X6 and X15 can only take on two possible values, we can use color and plotting symbol to show three dimensions of explanatory variables in a single 2D plot by using X8 as the x-axis, X6 as the color of the dots, and X15 as the plotting character of the dots. The graphic is somewhat interpretable.

plot(Y ~ X8, data=p3Data, pch=c(16,1)[as.factor(X15)], col=as.factor(X6))
b <- coef(lm5)
# Note how the following code replaces X8 with "x" because X8 is being 
# used as the x-axis of the plot.
x15 <- 0 #pch=16
x6 <- 0 #col="black"
curve(b[1] + b[2]*x15 + b[3]*x6*x + b[4]*x*x15, add=TRUE, lty=1, col="black")

x15 <- 0 #pch=16
x6 <- 1 #col="red"
curve(b[1] + b[2]*x15 + b[3]*x6*x + b[4]*x*x15, add=TRUE, lty=1, col="red")

x15 <- 1 #pch=1
x6 <- 0 #col="black"
curve(b[1] + b[2]*x15 + b[3]*x6*x + b[4]*x*x15, add=TRUE, lty=3, col="black")

x15 <- 1 #pch=1
x6 <- 1 #col="red"
curve(b[1] + b[2]*x15 + b[3]*x6*x + b[4]*x*x15, add=TRUE, lty=3, col="red")

Notice how the solid red line and dotted black line fit their respective data quite well. However, the dotted red line and solid black line do not do a very good job. This hints that our model is missing some terms. Let’s try the full interaction model of X6, X8, and X15, which surprisingly shows all terms significant!

lm6 <- lm(Y ~ X6*X8*X15, data=p3Data)
summary(lm6)
## 
## Call:
## lm(formula = Y ~ X6 * X8 * X15, data = p3Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73258 -0.44085 -0.00423  0.49823  1.82198 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.7118     0.6517  -5.696 1.09e-06 ***
## X6            1.9561     0.9017   2.169  0.03576 *  
## X8            2.6298     0.3205   8.204 2.91e-10 ***
## X15           9.5904     1.0239   9.367 7.62e-12 ***
## X6:X8         1.5568     0.4552   3.420  0.00141 ** 
## X6:X15       -4.5218     1.9362  -2.335  0.02437 *  
## X8:X15       -5.1217     0.5037 -10.169 6.78e-13 ***
## X6:X8:X15     1.8200     0.8847   2.057  0.04590 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8219 on 42 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9294 
## F-statistic:  93.1 on 7 and 42 DF,  p-value: < 2.2e-16

Now to draw it…

plot(Y ~ X8, data=p3Data, pch=c(16,1)[as.factor(X15)], col=as.factor(X6))
b <- coef(lm6)
b #b has 8 terms in it.
## (Intercept)          X6          X8         X15       X6:X8      X6:X15 
##   -3.711810    1.956099    2.629833    9.590431    1.556800   -4.521770 
##      X8:X15   X6:X8:X15 
##   -5.121697    1.820005
# Note how the following code replaces X8 with "x" because X8 is being 
# used as the x-axis of the plot.

x15 <- 0 #pch=16
x6 <- 0 #col="black"
curve(b[1] + b[2]*x6 + b[3]*x + b[4]*x15 + b[5]*x6*x + b[6]*x6*x15 + b[7]*x*x15 + b[8]*x6*x*x15, add=TRUE, lty=1, col="black")

x15 <- 0 #pch=16
x6 <- 1 #col="red"
curve(b[1] + b[2]*x6 + b[3]*x + b[4]*x15 + b[5]*x6*x + b[6]*x6*x15 + b[7]*x*x15 + b[8]*x6*x*x15, add=TRUE, lty=1, col="red")

x15 <- 1 #pch=1
x6 <- 0 #col="black"
curve(b[1] + b[2]*x6 + b[3]*x + b[4]*x15 + b[5]*x6*x + b[6]*x6*x15 + b[7]*x*x15 + b[8]*x6*x*x15, add=TRUE, lty=3, col="black")

x15 <- 1 #pch=1
x6 <- 1 #col="red"
curve(b[1] + b[2]*x6 + b[3]*x + b[4]*x15 + b[5]*x6*x + b[6]*x6*x15 + b[7]*x*x15 + b[8]*x6*x*x15, add=TRUE, lty=3, col="red")

Now that is fitting quite nicely!

One last check…

pairs(cbind(R=lm6$res,Fit=lm6$fitted.values,p3Data), col=interaction(p3Data$X15,p3Data$X6))

Nothing obvious at this point, and the Residuals vs Fitted values plot looks good enough. Let’s call it quits and state our final guess. (Back at the top of this section.)